Optical switching network

ABSTRACT

Systems and methods are disclosed for a method to communicate over an optical network by using hop-by-hop routing over an optical network; and dynamically constructing a network topology.

The present application claims priority to Provisional Application Ser.No. 61/362,482, filed Jul. 8, 2010, and 61/436,283, filed on Jan. 26,2011, the contents of which are incorporated by reference.

BACKGROUND

The present invention relates to an optical switching network.

Two key challenges faced by existing data center network (DCN)architectures are (a) balancing the demand for high bandwidthconnectivity between all pairs of servers with the associated high cost,and (b) having the flexibility to support a variety of applications andtheir traffic demand.

Many online services, such as those offered by Amazon, Google, FaceBook,and eBay, are powered by massive data centers hosting tens to hundredsof thousands of servers. The network interconnect of the data centerplays a key role in the performance and scalability of these services.As application traffic and the number of hosted applications grow, theindustry is constantly looking for larger server-pools, higher bit-ratenetwork-interconnects, and smarter workload placement approaches toeffectively utilize the network resources. To meet these goals, acareful examination of traffic characteristics, operator requirements,and network technology trends is critical.

High bandwidth, static network connectivity between all server pairsensures that the network can support an arbitrary application mix.However, static network topologies that provide such connectivity tendto be quite expensive (in terms of both the startup as well as recurringcosts), and cannot scale beyond a certain number of interconnectedservers. Further, for many applications, all-to-all connectivity at alltimes is not needed, and hence static network connectivity can be quitewasteful in these cases. Finally, such topologies also suffer from theneed to “re-wire” the network to support greater network bandwidthdemands from future applications.

Existing DCN architecture proposals attempt to address these challengesby using a hybrid approach that combines small-scale, all-to-allconnectivity using electrical interconnects with alternative datatransmission technologies (e.g. high-speed wireless or opticalswitching) that provide flexibility in terms of adapting to trafficdemands. In these approaches, the workload is split between theelectrical and optical network paths such that peak traffic is offloadedto the extra paths (could be wireless/optical/electrical). This use ofoptical or wireless transmission technologies as an add-on, as opposedto a fundamental component of the architecture, limits the applicabilityof these solutions to today's network traffic patterns and bandwidthdemands—the base network topology is not flexible and is built on theassumption that average traffic patterns are known in advance. Inaddition, these solutions also suffer from the need to re-wire theelectrical network to support higher throughputs.

SUMMARY

In one aspect, systems and methods are disclosed for a method tocommunicate over an optical network by using hop-by-hop routing over anoptical network; and dynamically constructing a network topology.

In one aspect, a method to communicate over an optical network includesdynamically constructing a network topology based on traffic demands andhop-by-hop routing; and constructing a dynamically changing data centernetwork (DCN) architecture.

In another aspect, a method for interconnecting a data center networkincludes using hop-by-hop routing over an optical network.

In yet another aspect, a method for interconnecting a data centernetwork includes using hop-by-hop routing over an optical network; andusing bidirectional optical network devices to enable bidirectionalcommunication over fiber.

In a further aspect, a method for interconnecting a data center networkincludes using hop-by-hop routing over an optical network; usingbidirectional optical network devices to enable bidirectionalcommunication over fiber; and dynamically constructing a networktopology.

In yet another aspect, a method for interconnecting a data center withan optical network includes using bidirectional optical network devicesto enable bidirectional communication over fiber.

Advantages of the preferred embodiment may include one or more of thefollowing. The system is the first-ever all-optical switchingarchitecture for data center networks (DCNs). By exploiting runtimereconfigurable optical devices, the system can dynamically changenetwork topology as well as link capacities, thus achievingunprecedented flexibility to adapt to different traffic patterns.

The system addresses these drawbacks of static network topologies byproviding a dynamic DCN architecture that can adapt to applicationtraffic demands in an efficient manner while also supporting highbandwidth server-to-server connectivity. The key feature is that allowsany subset of servers to be connected at full-bandwidth in an on-demandmanner without requiring static, all-to-all full bandwidth connectivity.

The preferred embodiment can adapt the network topology based onapplication traffic demands, while also supporting high bandwidthconnectivity between any subset of servers. To accomplish thesechallenging tasks, the system uses three basic building blocks: (1) aninnovative placement of optical devices, (2) algorithms for adaptivenetwork reconfiguration (Procedure 2(a), 2(b), 3, and 5) based ontraffic demand dynamics, and (3) hop-by-hop routing (Procedure 6).

The innovative placement of optical devices allows this preferredembodiment to use re-configurable optical paths. This enables the systemto be flexible in terms of path and capacity assignment between theservers. Exactly how these paths are re-configured to interconnectservers, as well as the capacity of each path, is controlled by ouradaptive network re-configuration algorithms. By extensively usingoptical fibers that have the ability to support higher bandwidths simplyby adding wavelengths, higher throughputs can be supported withoutre-wiring. As Proteus does not impose the requirement of underlyingall-to-all electrical connectivity between the servers, and due to thephysical limitation on the number of possible optical paths betweenservers, the inclusion of hop-by-hop routing is necessary in our design.The intuition here is that if a direct optical path does not exist, ahop-by-hop path can be used instead. For this purpose, we include amulti-hop routing protocol that uses source-routing.

Other advantages of the preferred embodiment may include one or more ofthe following:

1) On-demand flexibility: Proteus does not make any assumption ontraffic patterns and is able to adaptively reconstruct networkcommunication paths based on traffic demand. This makes the preferredembodiment highly appealing to future data centers where both thenetwork and application may evolve over time.

2) High server-to-server throughput: Proteus significantly improves thecommunication bandwidth between any pair of servers. Once the opticalcircuit path is set up, a bit rate transparent communication pipebecomes available. With current technologies, per channel bit rate inoptical fiber communications can be as high as 40 Gb/s or 100 Gb/s, andthe total capacity per fiber with DWDM technologies can reach 69 Tb/s.

3) Efficient network resource utilization: Network paths are dynamicallyconstructed based on traffic demand in such a way that overallnetwork-wide traffic can be maximally served. This global optimizationovercomes network resource fragmentation incurred by today's tree-basedDCN architectures and other existing approaches where local optimizationis adopted.

4) Cabling simplicity: One of challenges faced by current data centernetworks is caused by the high complexity of a large number ofconnecting cables. With the adoption of optical fiber cabling, networkupgrades and expansion can be achieved by adding additional wavelengths,instead of additional cables.

5) Lower power consumption: Optical components generally consume afraction of energy relative to their electrical counterparts, and sincethis preferred embodiment uses optical components extensively, theoverall DCN power consumption should be lowered significantly.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows an exemplary system with optical interconnects in a datacenter network.

FIG. 2 shows in more details the optical component of FIG. 1.

FIG. 3 shows an exemplary control manager for the system of FIG. 1.

FIG. 4 shows an exemplary Greedy-Tree method to dynamically reconstructrouting paths according to changing network traffic demand.

FIG. 5 shows an exemplary Darwinian method to dynamically reconstructrouting paths according to changing network traffic demand.

FIG. 6 shows an exemplary fault-tolerant routing method.

FIG. 7 shows an exemplary wavelength assignment method.

DETAILED DESCRIPTION

FIG. 1 shows an exemplary system with optical interconnects in a datacenter network. An optical switch matrix (OSM) 102 allows a plurality ofoptical ports to communicate with each other through optical components110. Each optical component 110 in turn communicates with a top of rack(ToR) switch. Each ToR switch in turn is connected to plurality ofservers and to other ToRs.

The system of FIG. 1 uses hop-by-hop routing, in which traffic thatcannot be provisioned with a direct end-to-end circuit will be routed tothe destination by traversing multiple hops (i.e., TOR switches). EachTOR switch not only receives traffic destined at servers located in itsown rack, but also forwards transit traffic targeted at servers residingin other racks. This mechanism allows the system of FIG. 1 to achieveconnectivity between any pair of origin and destination servers. Thisapproach is in contrast to conventional optical communication systems,in which only single-hop routing is employed.

In one particular instantiation, each TOR switch is a conventionalswitch with 64 10-GigE ports. Of these 64 ports at each ToR, 32 areconnected to servers via existing intra-ToR interconnects. Each of theremaining 32 ports is used to connect to the optical interconnectbetween ToRs. Each inter-ToR port is attached to transceivers associatedwith a fixed wavelength for sending and receiving data. Excluding theToR switches, all the remaining interconnect elements are optical. Theseoptical elements allow for reconfiguration, making the network highlyadaptive to changes in the underlying traffic requirements.

The system of FIG. 1 uses all optical interconnects. In contrast totheir electrical counterparts, optical network elements supporton-demand provisioning of connectivity and capacity where required inthe network, thus permitting the construction of thin, but malleableinterconnects for large server pools. Optical links can support higherbit-rates over longer distances using less power than copper cables.Moreover, optical switches run cooler than electrical ones, implyinglower heat dissipation and cheaper cooling cost.

FIG. 2 shows in more details the optical component 110. To make full useof the MEMS ports, each circuit over the MEMS is bidirectional. Forthis, optical circulators 126 and 136 are placed between the ToR andMEMS ports. A circulator 126 connects the send channel of thetransceiver from a ToR 120 to the MEMS port 102 (after the channel haspassed through the WSS 124). It simultaneously delivers the trafficincoming towards a ToR from the MEMS, to this ToR. Even though the MEMSedges are bidirectional, the capacities of the two directions areindependent of each other. The inter-ToR ports attach themselves to twotransceivers so that they can send and receive data simultaneously. Asshown in the left half of FIG. 2, the optical fiber from the “send”transceivers from each of the 32 ports at a ToR 120 is connected to anoptical multiplexer 122. Each port is associated with a wavelength,unique across ports at the ToR 120, in order to exploit wavelengthdivision multiplexing (WDM). This allows data from different ports to bemultiplexed into one fiber without contention. This fiber is thenconnected to a 1×4 Wavelength Selective Switch (WSS) 124. The WSS 124 istypically an optical component, consisting of one common port andwavelength ports. It partitions the set of wavelengths coming in throughthe common port among the wavelength ports and the mapping isruntime-configurable (in a few milliseconds). The WSS 124 can split theset of 32 wavelengths it sees into four groups, each group beingtransmitted out on its own fiber. This fiber is connected to the MEMSoptical switch 102 through a circulator 126 to enable bidirectionaltraffic through it. The circulators enable bidirectional opticaltransmission over a fiber, allowing more efficient use of the ports ofoptical switches. An optical circulator is a three-port device: one portis a shared fiber or switching port, and the other two ports serve assend and receive ports. Optical transceivers can be of two types: coarseWDM (CWDM) and dense WDM (DWDM). One embodiment uses DWDM-basedtransceivers, which support higher bit-rates and more wavelengthchannels in a single piece of fiber compared to CWDM.

The receiving infrastructure (shown in the right half of FIG. 2) has acoupler 136 connected to a demultiplexer 132 which separates multipleincoming wavelengths, each then delivered to a different port. In oneembodiment, four receive fibers from each of four circulators, areconnected to a power coupler 134 which combines their wavelengths ontoone optical fiber. This fiber feeds into a demultiplexer 132 whichsplits each incoming wavelength to its associated port for a TOR 130. Inone embodiment, the interconnect of FIG. 1 uses a 320-portmicro-electrical mechanical systems (MEMS) switch, to connect 80 ToRswith a total of 2560 servers.

Depending on the channel spacing, using WDM, a number of channels orwavelengths can be transmitted over a single piece of fiber in theconventional or C-band. In one embodiment, each wavelength israte-limited by the electrical port it is connected to. The OSM modulesin optical communications can be bipartite switching matrices where anyinput port can be connected to any one of the output ports.Micro-Electro-Mechanical Switch (MEMS) can be used as an OSM andachieves reconfigurable one-to-one circuit between its input and outputports by mechanically adjusting micro mirrors.

The system of FIG. 2 offers highly flexible bandwidth. Every ToR hasdegree k. If each edge had fixed bandwidth, multiple edges would need tobe utilized for this ToR to communicate with another ToR at a ratehigher than a single edge supports. To overcome this problem, the systemcombines the capability of optical fibers to carry multiple wavelengthsat the same time (WDM) with the dynamic reconfigurability of the WSS.Consequently, a ToR is connected to MEMS through a multiplexer and a WSSunit.

Specifically, suppose ToR A wants to communicate with ToR B using wtimes the line speed of a single port. The ToR will use w ports, eachassociated with a (unique) wavelength, to serve this request. WDMenables these w wavelengths, together with the rest from this ToR, to bemultiplexed into one optical fiber that feeds the WSS. The WSS splitsthese w wavelengths to the appropriate MEMS port which has a circuit toToR B (doing likewise for k−1 other sets of wavelengths). Thus, aw×(line-speed) capacity circuit is set up from A to B, at runtime. Byvarying the value of w for every MEMS circuit connection, the systemoffers dynamic capacity for every edge.

In one embodiment, each ToR can communicate simultaneously with any fourother ToRs. Thus, the MEMS switch 102 can construct all possible4-regular ToR interconnection graphs. Secondly, through WSSconfiguration, each of these four links' capacity can be varied in {0,10, 20, . . . , 320} Gbps, provided the sum does not exceed 320 Gbps.Thus, both the path between servers as well as the capacity of thesepaths can be varied in this architecture.

To enable a ToR pair to communicate using all available wavelengths,each ToR port (facing the optical interconnect) is assigned a wavelengthunique across ports at the ToR. The same wavelength is used to receivetraffic as well: each port thus sends and receives traffic at one fixedwavelength. The same set of wavelengths is recycled across ToRs. Thisallows all wavelengths at one ToR to be multiplexed and delivered afterdemultiplexing to individual ports at the destination ToR. Thiswavelength-port association is a static, design/build time decision.

One examplary specific instantiation of FIG. 1 deploys N=80 ToRs, W=32wavelengths and k=4 ToR-degree using a 320 port MEMS to support 2560servers. Each ToR is a conventional electrical switch with 64 10-GigEnon-blocking ports. 32 of these ports are connected to servers, whilethe remaining face the optical interconnect. Each port facing theoptical interconnect has a transceiver associated with a fixed andunique wavelength for sending and receiving data. The transceiver usesseparate fibers to connect to the send and receive infrastructures. Thesend fiber from the transceivers from each of the 32 ports at a ToR isconnected to an optical multiplexer. The multiplexer feeds a 1×4 WSS.The WSS splits the set of 32 wavelengths it sees into 4 groups, eachgroup being transmitted on its own fiber. These fibers are connected tothe MEMS switch through circulators to enable bidirectional trafficthrough them. The 4 receive fibers from each of 4 circulatorscorresponding to a ToR are connected to a power coupler (similar to amultiplexer, but simpler), which combines their wavelengths onto onefiber. This fiber feeds a demultiplexer, which splits each incomingwavelength to its associated port on the ToR.

In this interconnect, each ToR can communicate simultaneously with any 4other ToRs. This implies that MEMS reconfigurations allow us toconstruct all possible 4-regular ToR graphs. Second, through WSSconfiguration, each of these 4 links' capacity can be varied in {0, 10,20, . . . 320} Gbps. As discussed in more details below, theseconfigurations are decided by a centralized manager. The manager obtainsthe traffic matrix from the ToR switches, calculates appropriateconfigurations, and pushes them to the MEMS, WSS, and ToRs. Thisrequires direct, out-of-band connections between the manager and theseunits. The implementation is highly flexible—given a number N ofTop-of-Rack (ToR) switches and a design-time-fixed parameter k, thesystem can assume any k-regular topology over the N ToRs. To illustratehow many options this gives, consider that for just N=20, there are over12 billion (non-isomorphic) connected 4-regular graphs. In addition, thesystem allows the capacity of each edge in this k-regular topology to bevaried from a few Gb/s to a few hundred Gb/s. Simulations show that thesystem can always deliver full bisection bandwidth for low-degree (e.g.,inter-ToR≦4) traffic patterns, and even over 60% of throughput of anon-blocking network in case of moderately high-degree (e.g.,inter-ToRε[4,20]) traffic patterns. Furthermore, it enables lower (50%)power consumption and lower (20%) cabling complexity compared to afat-tree connecting a similar number of servers. While at current retailprices, the system is marginally more costly (10%) than a fat-tree (at10 GigE per-port), a cost advantage should materialize as opticalequipment sees commoditization, and higher bit-rates gain traction.

With a larger number of MEMS and WSS ports, topologies with higherdegrees and/or larger numbers of ToRs can be built. It is also possibleto make heterogeneous interconnects—a few nodes can have larger degreethan the rest.

The system of FIGS. 1-2 achieves topology flexibility by exploiting thereconfigurability of the MEMS. Given a ToR-graph connected by opticalcircuits through the MEMS, the system uses hop-by-hop stitching of suchcircuits to achieve network connectivity. To reach ToRs not directlyconnected to it through the MEMS, a ToR uses one of its connections.This first-hop ToR receives the transmission over fiber, converts it toelectrical signals, reads the packet header, and routes it towards thedestination. At each hop, every packet experiences conversion fromoptics to electronics and then back to optics (O-E-O). Such conversioncan be done in sub-nanosecond level. At any port, the aggregate transit,incoming and outgoing traffic cannot exceed the port's capacity in eachdirection. So, high-volume connections must use a minimal number ofhops. The system manages the topology to adhere to this requirement.

To support adapting to a wider variety of traffic patterns, the flexibleDCN architecture of FIG. 1 also needs topology management manager that(a) configure the MEMs to adjust the topology to localize high trafficvolumes, b) configure the WSS at each ToR to adjust the capacity of itsfour outgoing links to provision bandwidth where it is most gainful, and(c) pick routes between ToR-pairs to achieve high throughput, lowlatency and minimal network congestion.

The control software run by the topology manager solves this problem oftopology management, which can be formulated as a mixed-integer linearprogram. In the following discussion, a traffic demand D betweenToRs—D_(ij) is the desired bandwidth from ToR_(i) to ToR_(j).

Variables: Four classes of variables: l_(ij)=1 if ToR is connected toToR_(j) through MEMS and 0 otherwise; w_(ijk)=1 if l_(ij) carrieswavelength λ_(k) in the i→j direction and 0 otherwise; a traffic-servedmatrix S—S_(ij) is the bandwidth provisioned (possibly over multiplepaths) from ToR_(i) to ToR_(j); v_(ijk) is the volume of traffic carriedby wavelength λ_(k) along i→j. Among the latter two sets of variables,S_(ij) have end-to-end meaning, while v_(ijk) have hop-to-hopsignificance. For all variables, kε{1, 2, . . . , λ_(Total)}; i,jε{1, 2,. . . , # ToRs}, i≠j; l_(ij) are the only variables for whichl_(ij)=l_(ji) always holds—all other variables are directional.

Objective: A simplistic objective is to maximize the traffic served(constrained by demand, see (6)):

$\begin{matrix}{{Maximize}{\sum\limits_{i,j}{S_{ij}.}}} & (1)\end{matrix}$

Constraints:

A wavelength λ_(k) can only be used between two ToRs if they areconnected through MEMS:

∀i,j,k:w_(ijk)≦l_(ij).  (2)

ToR_(i) can receive/send λ_(k) from/to at most one ToR (this isillustrated in FIG. 3):

$\begin{matrix}{{\forall i},{{{k\text{:}\mspace{14mu} {\sum\limits_{j}w_{jik}}} \leq 1};{{\sum\limits_{j}w_{ijk}} \leq 1.}}} & (3)\end{matrix}$

If the number of ports of the WSS units is W, then ToR is connected toexactly W other ToRs:

$\begin{matrix}{{\forall{i\text{:}\mspace{14mu} {\sum\limits_{j}l_{ij}}}} = {W.}} & (4)\end{matrix}$

Hop-by-hop traffic is limited by port capacities (C_(port)), wavelengthcapacity (C_(λ)), and provisioning:

∀i,j,k:v _(ijk)≦min{C _(port) ,C _(λ) ×w _(ijk)}.  (5)

A constraint is to never provision more traffic than demanded:

∀i,j:S_(ij)≦D_(ij).  (6)

The outgoing transit traffic (total traffic flowing out, minus totaltraffic for which ToR_(i) is the origin) equals incoming transit trafficat ToR_(i):

$\begin{matrix}{{\forall{{i\text{:}\mspace{14mu} {\sum\limits_{j,k}v_{ijk}}} - {\sum\limits_{j}S_{ij}}}} = {{\sum\limits_{j,k}v_{jik}} - {\sum\limits_{j}{S_{ji}.}}}} & (7)\end{matrix}$

The above mixed-integer linear program (MILP) can be seen as a maximummulti-commodity flow problem with degree bounds, further generalized toallow constrained choices in edge capacities. While several variants ofthe degree-bounded subgraph and maximum flow problems have knownpolynomial time algorithms, trivial combinations of two are known to beNP-hard. Thus, to simplify the computation, we present heuristicapproaches for the control software for finding the optimized topologyand link capacity assignment to meet the changing traffic patterns isdiscussed. The control software tightly interacts with OSM/MEMS, WSS andToR switches to control the network topology, link capacity and routing.

FIG. 3 shows an exemplary control manager 200 that controls the system100 of FIG. 1. The control system includes a module 202 that estimatestraffic demand. The module 202 provides input to a module 204 thatassigns pairs with heavy communications to direct links. Next a module206 performs the connectivity accordingly. Through modules 204-206, themanager 200 controls the MEMS optical switch 102 to adjust the networktopology. Next, a module 210 identifies routing paths and sends all theToRs these paths in order to set up their routing tables. A module 214then determines the capacity demand on each link and a module 216 thendetermines the wavelength assignment scheme.

In one embodiment, as conventionally done, the software estimates thetraffic demand according to max-min fair bandwidth allocation for TCPflows in an ideal non-blocking network. All the flows are only limitedby the sender or receiver network interface cards (NICs).

The manager assigns direct links for heavy communicating pairs.High-volume communicating pairs (i.e., ToR switches) over direct MEMScircuit links. This is accomplished by using a weighted b-matching,where b represents the number of connections that each ToR has to MEMS(b=4 in our example scenario). It is easy to cast the problem oflocalizing high-volume ToR-connections to b-matching: In the ToR graph,assign the edge-weight between two ToRs as the estimated flow-sizebetween them. Weighted b-matching is a graph theoretic problem for whichan elegant polynomial-time algorithm is known. In one embodiment, theweighted b-matching algorithm is approximated using multiple1-matchings.

Connectivity is achieved through the edge-exchange operation as follows.First, the method locates all connected components. If the graph is notconnected, the method selects two edges a→b and c→d with lowest weightsin different connected components, and simply replace links a→b and c→dwith links a→c and b→d to connect them. A check is done to make surethat the links removed are not themselves cuts in the graph. The outputof steps 2 and 3 is used to tell the MEMS optical switch 102 how toconfigure the network topology.

Once connectivity is determined, the MEMS optical switch configurationis known. The method finds routes using any of the standard routingschemes such as the shortest path or a low congestion routing scheme.Some of the routes are single-hop MEMS connection while others aremulti-hop MEMS connections. In one implementation, the standard shortestpath technique is used to calculate the routing paths. However, theframework can be readily applied to any other routing scheme. The outputis used to tell ToRs on how to configure their routing tables.

Given the routing and the estimated traffic demand (aggregated) betweeneach pair of ToRs, the method computes the link capacity desired on eachlink. To satisfy the capacity demand on each link, multiple wavelengthsmay be used. However, the sum of capacity demands of all linksassociated with a ToR switch must not exceed the capacity of this ToR.

After figuring out the desired capacity on each link, the system needsto provision wavelengths appropriately to serve these demands. Thisproblem is reduced to an edge-coloring problem on a multigraph. Multipleedges correspond to volume of traffic between two nodes, and wavelengthsare the colors to be used to color these edges. For instance, D→A andB→A cannot both use the same wavelength. This constraint stems from thefact that two data-flows encoded over the same wavelength can not sharethe same optical fiber in the same direction. Various fast edge-coloringheuristics can be used, and an algorithm based on Vizing's theorem isused in one embodiment due to speed and code availability.

On implementation requires at least one wavelength to be assigned toeach edge on the physical topology. This guarantees an available pathbetween any ToR-pair, which may be required for mice/bursty flows. Theoutput is used to tell WSS on how to assign wavelengths.

During the operation, the system works based on the value of η. η isdefined as the expected throughput achieved via the link capacityadjustment versus that achieved via network topology change. If thethroughput obtained by only adjusting link capacity is significantenough compared to that obtained by rearranging the topology, the systemcan adjust link capacity while keep the current topology. This ischeaper than changing the topology since topology changes necessitatechange in the routing tables of ToRs. It is possible that the trafficpattern is fundamentally changed so that only adjusting the linkcapacity cannot provide a satisfactory throughput. In this case, thesystem reconfigures the network topology. In practice, the system canmodify η on-demand to satisfy different performance requirements.

Due to easy availability of network state (e.g., topology, trafficdemand etc) at the manager, routing can be easily realized in acentralized manner, where the manager is responsible for calculating andupdating the routing table for each ToR. For simplicity, the manageremploy shortest path routing with failover paths. Howeever, any othersophisticated routing algorithms can be readily applied. The flexibilityof the architecture of FIG. 1 can be used not only to meet the changingtraffic patterns, but also to handle failures (e.g., a WSS port failurecan be taken care of via dynamically assigning that port's wavelength toremaining ports). In addition, the system graphs are inherentlyfault-tolerant due to their path redundancy and we demonstrate, viasimulations, appealing performance in the presence of a large percentageof link and/or node failures.

FIG. 4 shows another exemplary GreedyTree method to dynamically adjustthe topology according to changing network traffic demand, differentfrom the above method. This mechanism is a tree inspired design andattempts to form a tree in such a way that traffic is concentratedtowards the leaves, so that voluminous flows don't occupy large of hops.In this method, the input is a traffic matrix D (traffic demand betweenany pair of racks) where Di,j denotes traffic travelling from ToR i toToR j. D is asymmetric due to the directional nature of network traffic.First, the method initializes a virtual node set V (302). Next, themethod checks if V has only one element (304) and if so, exitsprocessing. Alternatively, the method determines a traffic matrix M overthe set V (306), and then applies maximum weighted bipartite matching todetermine which pairs of nodes should be connected to form a higherlevel virtual node (308). Next, for each pair of nodes to connect,standard matching is used to determine the real underlying nodes toconnect (310). If there are not enough wavelengths to connect the nodes,the method reassigns least used wavelengths from the lower levels whilemaintaining connectivity (310). The method loops back to 304 until allelements are processed.

In one embodiment, for each iteration, the method attempts to connectpairs of virtual nodes that yield the maximum benefit by finding amatching. The initial set of virtual nodes is the same as the set ofToRs. At every stage, pairs of virtual nodes from the previous stage areconnected. The total bandwidth demand across two virtual-nodes is firstcomputed by summing demands from the real nodes in each virtual-node tothe other. These pair-wise demands are used as weights for a standardmatching algorithm (such as Edmond's algorithm, among others) to obtainthe best set of virtual-edges. Each virtual edge can have one or morereal edges and a number of wavelengths. These edges and wavelengths aredetermined by a heuristic-based function which uses matching restrictedto only the sets of nodes in the two virtual-nodes being connected. Ifmore wavelengths and links are required than are available from the twovirtual-nodes, then links and wavelengths from the lower-level areharvested (least useful at lower-level first) while preservingconnectivity. The algorithm iterates until it has built one largevirtual node. Once the method terminates, all configurations are pushedto the optical elements.

Another heuristic alternative to FIG. 4 is discussed next. FIG. 5 showsan exemplary Darwinian method to dynamically reconstruct routing pathsaccording to changing network traffic demand. First, the methodinitializes a virtual node set V (330). Next, the method determines atraffic matrix M over the set V (332), and then applies a 4 matchingtechnique to determine which pairs of nodes should be connected to forma higher level virtual node (334). Next, the method makes the graphconnectivity using edge-exchange operations (336).

The Darwinian heuristic attempts to localize high-volume flows overdirect circuit links. This is accomplished by using a weighted matchingrestricted to a degree of 4 (i.e., weighted 4-matching), representingthe number of connections each ToR has to the MEMS. However, this doesnot impose connectivity. Connectivity is ensured using the edge-exchangeoperation on the edges of lowest weight across pairs of components, thusconnecting them. This edge-exchange operation is repeated untilconnectivity is achieved between all source-destination pairs.

The Darwinian heuristic is based on the idea of starting out with astructured topology (like a k-regular circulant graph, a Kautz digraph,an incomplete hypercube, or even a DCell-like topology) from which thetopology keeps evolving. Over this topology, it is possible to usedegree-preserving operations to better conform to the traffic matrix. Soif two ToRs which seek to establish a high bandwidth connection areconnected to two other ToRs and are not serving much transit traffic,they can be connected directly, by breaking one of their current links.The advantage of this method is that it is iterative and each iterationshould be computationally inexpensive. It is also likely that a largenumber of large flows do not change simultaneously, thus a large numberof such operations are should rarely be required. It is possible to usethis method as a continuous background optimization. The objective is toensure that a weighted sum of path lengths is minimized.

The GreedyTree and Darwinian heuristics or processes reconstruct thenetwork topology in adaptation to changing traffic demand and can dealwith arbitrary traffic patterns. This is in contrast to conventionalsystems where a particular traffic pattern is assumed. The GreedyTreemethod intelligently utilizes the switching and reconfigurationfunctionalities of WSS and adaptively redistributes wavelengthassignment to cope with topology and routing changes. This is also thefirst application of WSS in data center networks.

Once connectivity is achieved, the MEMS configuration is known. Thesystem finds routes using any of standard routing schemes—shortest pathor preferably, a low congestion routing scheme. In one embodiment shownin FIG. 6, a simple, yet effective, shortest path routing scheme calledFault-tolerant Proteus Routing (FPR) is used.

In FIG. 6, the input is the topology represented by a graph G(V, E), theedge weights w, the source node s, and the destination node d. Duringinitialization, the weight of each edge is set to one (350). Next, themethod determines the primary path between s and d:P_(Primary)=shortest_path(G, s, d, w) (352). The method then determinesthe failover path between s and d (354). In one embodiment, this is doneby determining for each edge e on the primary path P_(Primary),calculate w(e)=w(e)+|E|; and P_(Failover)=shortest_path(G, s, d, w).Finally, the method returns P_(Primary) and P_(Failover) as the result(356).

The basic idea of FPR is simple. Leveraging on network status, theManager is responsible for calculating the routing table for each ToRswitch. In one embodiment, for simplicity, the shortest path routingmethod of FIG. 6 is used for routing table construction. However, thescheme is readily applied to any other sophisticated routingcalculation. Once link or node failures happen, the related devices willreport to the Manager, then the Manager will react by evoking thecontrol software to rearrange the link capacity or topology (based onthe degree of failures) to bypass the failed parts. In this sense, FPRis a simple and flexible way to handle failures largely due to thearchitecture of FIG. 1.

FIG. 7 shows an exemplary wavelength assignment method. Turning now toFIG. 7, the input is a system graph and capacity demand on each link.For each link, the method determines the number n of wavelengths tosatisfy the capacity demand and replaces the link with n paralleldirected links (380). Next, the method converts the resulting directedgraph to an undirected graph by merging anti-parallel links (382). Themethod then applies a standard edge-coloring heuristics on this graph,where wavelengths are the colors to be used to color these edges (384).If the resulting graph is with one more extra color, then the methodremoves the color (i.e., wavelength) that is least used (386).

Using the method of FIG. 7, the system provisions or allocateswavelengths to serve capacity requirements. In one example, the systemfirst decides the necessary number (say n) of wavelengths allocated toeach optical fiber to meet the capacity requirements and replaces thislink with n parallel directed links in the graph. For instance, if eachwavelength maximally carries 10 Gb/s and the capacity requirement of aparticular link is 45 Gb/s, then the system replaces this link with 5parallel links in the graph. This way, after this operation, we obtain agraph with degree of 32 for each node. In the second step, the systemconverts the resulting directed graph to an undirected graph by merginganti-parallel links, i.e., merging the directed link from node u to vand the one from v to u. Now, the system gets a new undirected graphwith node degree 32. Then, the system applies a standard edge-coloringheuristics on this graph, where wavelengths are the colors to be used tocolor these edges. Since the heuristics may end up with coloring thegraph with one more extra color (i.e., 33), then the final step is justto remove the color (i.e., wavelength) that is least used.

Next, a hop-by-hop routing method is discussed. This methodautomatically generates hop-by-hop routing protocols based on networktopology changes. This is also a breakthrough in optical communicationsespecially in the context of data center networks, where onlypoint-to-point optical communication is considered.

As the system does not impose the requirement of underlying all-to-allelectrical connectivity between the servers, and due to the physicallimitation on the number of possible optical paths between servers, theinclusion of hop-by-hop routing is necessary in the design. If a directoptical path does not exist, a hop-by-hop path can be used instead. Forthis purpose, a multi-hop routing protocol is used. Once a suitableconfiguration and paths have been computed, these are pushed to allToRs. ToRs thus know their routes to all other ToRs and use sourcerouting. Each packet from a server destined to some other server outsidethe ToR is tunneled through this source-routing protocol between ToRs.At the source ToR, a sequence of destination ToRs is specified in theheader and sent to the first ToR through the local forwarding table. Thefirst hop then looks at the next hop in sequence and sends the packet toit and this is repeated until the data reaches the destination.

The all-optical network described herein can be easily supplemented withother forms of network connectivity including wireless and electricalnetworks.

It will be apparent to those skilled in the art that variousmodifications and variation can be made in the present invention withoutdeparting from the spirit or scope of the invention. Thus, it isintended that the present invention covers the modifications andvariations of this invention provided they come within the scope of theappended claims and their equivalents.

1. A method for interconnecting a data center network, said methodcomprising: using hop-by-hop routing over an optical network; anddynamically constructing a network topology.
 2. The method of claim 1,comprising receiving a traffic matrix to create on-demand the networktopology.
 3. The method of claim 2, comprising applying a Greedy-Treeheuristic.
 4. The method of claim 3, comprising determining a totalbandwidth demand across two virtual-nodes by summing demands from thereal nodes in each virtual node to the other.
 5. The method of claim 4,comprising determining:${{PairDemand}\left( {{\upsilon \; N_{i}^{q}},{\upsilon \; N_{j}^{q}}} \right)} = {{\sum\limits_{{a \in {\upsilon \; N_{i}^{q}}},{b \in {\upsilon \; N_{i}^{q}}}}D_{ab}} + {D_{ba}.}}$6. The method of claim 4, wherein pairwise demands are used as weightsfor standard matching to obtain the best set of virtual-edges.
 7. Themethod of claim 4, wherein each virtual edge can have one or more realedges and a number of wavelengths.
 8. The method of claim 4, whereinedges and wavelengths are determined by matching restricted to only setsof nodes in two virtual-nodes being connected.
 9. The method of claim 2,comprising applying a Darwinian heuristic.
 10. The method of claim 9,comprising localizing high-volume flows over direct circuit links. 11.The method of claim 9, comprising performing an n-matching technique todetermine which pairs of nodes should be connected to form a higherlevel virtual node and generating graph connectivity using edge-exchangeoperations.
 12. The method of claim 11, wherein connectivity is ensuredusing the edge-exchange operation on edges of lowest weight across pairsof components.
 13. The method of claim 9, comprising performing weightedmatching restricted to a degree of N (i.e., weighted N-matching), whereN is the number of connections to other top-of-racks (ToRs).
 14. Themethod of claim 1, comprising applying the multi-hop routing to form anoptimal network topology that maximally serves overall network trafficdemand.
 15. The method of claim 14, wherein the multi-hop routingcomprises source-routing.
 16. The method of claim 14, comprisingdetermining and sending a suitable configuration and paths to ToRs. 17.The method of claim 1, wherein each packet from a server destined to aserver outside the ToR is tunneled through a source-routing protocolbetween ToRs.
 18. The method of claim 1, comprising specifying asequence of destination ToRs in a header by a source ToR, and sending toa first ToR through a local forwarding table.
 19. The method of claim 1,wherein a first hop looks at a subsequent hop in sequence and sends thepacket to the subsequent hop.
 20. The method of claim 1, comprisingrouting data over a supplementary electrical network or wirelessnetwork.
 21. A method for interconnecting a data center network, saidmethod comprising using hop-by-hop routing over an optical network. 22.A method for interconnecting a data center network, said methodcomprising using hop-by-hop routing over an optical network; and usingbidirectional optical network devices to enable bidirectionalcommunication over fiber.
 23. The method of claim 22, comprisingdynamically constructing a network topology.
 24. A method to communicateover an optical network, comprising dynamically constructing a networktopology based on traffic demands and hop-by-hop routing; andconstructing a dynamically changing data center network (DCN)architecture.